Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(raft): Implement Raft-based Consistent Hash State Management #636

Merged
merged 26 commits into from
Dec 14, 2024

Conversation

sinadarbouy
Copy link
Collaborator

@sinadarbouy sinadarbouy commented Nov 29, 2024

Ticket(s)

Implements part of #628 - Raft-based State Synchronization for GatewayD Instances

Description

This PR introduces Raft consensus for managing consistent hash state across GatewayD instances. This is the first phase of implementing Raft-based state synchronization, focusing specifically on the consistent hash load balancer state.

Key changes include:

  • Added Raft implementation using HashiCorp's Raft library
  • Implemented gRPC communication between Raft nodes to handle RPC requests and ensure efficient data exchange.
  • Integrated BoltDB for Raft log persistence
  • Modified consistent hash implementation to use Raft for state synchronization
  • Added configuration options for Raft (node ID, address, peers, etc.)
  • Implemented leadership monitoring and peer management
  • Added tests for Raft integration with consistent hash

The implementation ensures that consistent hash mappings are synchronized across all GatewayD instances, providing better consistency in load balancing decisions across the cluster.

Related PRs

N/A - This is the first PR implementing Raft consensus

  • I have added a descriptive title to this PR.
  • I have squashed related commits together.
  • I have rebased my branch on top of the latest main branch.
  • I have performed a self-review of my own code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added docstring(s) to my code.
  • I have made corresponding changes to the documentation (docs).
  • I have updated docs using make gen-docs command.
  • I have added tests for my changes.
  • I have signed all the commits.

Legal Checklist

This commit introduces Raft consensus to maintain consistency of hash-to-proxy
mappings across multiple GatewayD instances. Key changes include:

- Add new Raft package implementing consensus protocol using HashiCorp's Raft
- Integrate Raft with consistent hashing load balancer
- Store proxy mappings in distributed state machine
- Add configuration options for Raft cluster setup
- Implement leadership monitoring and peer management
- Add FSM snapshot and restore capabilities

The implementation ensures that hash-to-proxy mappings remain consistent across
cluster nodes, improving reliability for consistent hash-based load balancing.
…y mapping

- Replace proxy ID with block name for consistent hash mapping
- Remove direct raft node dependency from ConsistentHash struct
- Add ProxyByBlock map to Server for block-based proxy lookups
- Include group name in hash key generation for better distribution
- Add proxy initialization during server startup
- Update FSM to use consistent naming for hash map storage

This change improves the consistent hashing mechanism by using block names
instead of proxy IDs, making it more aligned with the block-based
architecture while maintaining backwards compatibility with the original
load balancing strategy.
- Remove unused UUID-based ID field from Proxy struct
- Remove GetID() method from IProxy interface and Proxy implementation
- Remove GetProxyByID() method from Server struct
- Remove uuid package dependency

The proxy ID was not being used meaningfully in the codebase, so removing it
simplifies the proxy implementation.
This commit introduces comprehensive Raft testing infrastructure and enhances
the consistent hash implementation with distributed state management.

Key changes:
- Add new test cases for Raft leadership, follower behavior, and FSM operations
- Integrate Raft with consistent hash load balancer for distributed state
- Add TestRaftHelper utility for simplified Raft testing setup
- Update consistent hash tests to use Raft for state persistence
- Add GetState method to RaftNode for state inspection
- Improve test coverage for concurrent operations

The changes ensure that proxy mappings are consistently maintained across
the cluster using Raft consensus, making the load balancer more reliable
in distributed environments.
- Add Directory field to Raft config to make raft storage location configurable
- Use t.TempDir() in tests to ensure proper cleanup of test directories
- Rename HashMapCommand to ConsistentHashCommand for better clarity
- Update command type constants and map names to be more descriptive
- Fix test flakiness by using unique node IDs and random available ports
- Remove manual directory cleanup in favor of t.TempDir() cleanup
- Update configuration files with raft directory settings

This change improves test stability and makes the raft storage location
configurable while cleaning up naming conventions throughout the raft package.
Add default configuration values for Raft consensus implementation:
- RaftAddress: 127.0.0.1:2223
- RaftNodeID: node1
- RaftLeaderID: node1
- RaftDirectory: raft

This change initializes the default Raft configuration in the config loader.
- Enhance error handling with wrapped errors and detailed messages
- Add meaningful constants for timeouts and configuration values
- Rename RaftNode to Node for better clarity
- Fix JSON field names to match Raft convention (nodeId, leaderId)
- Add missing error checks in critical paths
- Improve documentation and code comments
- Update golangci linter settings to include raft package
- Introduced a temporary directory for Raft using t.TempDir() in the Test_pluginScaffoldCmd test case.
- Set the GATEWAYD_RAFT_DIRECTORY environment variable to the new temporary directory.
- This change ensures that Raft operations during testing are isolated and do not interfere with other tests or system directories.
Copy link

github-actions bot commented Nov 29, 2024

Overview

Image reference ghcr.io/gatewayd-io/gatewayd:f6aba9f gatewaydio/gatewayd:latest
- digest 3cb5a6a54232 80f3e87db481
- tag f6aba9f latest
- provenance 7f47dca
- vulnerabilities critical: 0 high: 1 medium: 1 low: 0 critical: 0 high: 1 medium: 1 low: 0
- platform linux/amd64 linux/amd64
- size 20 MB 17 MB (-2.8 MB)
- packages 143 131 (-12)
Base Image alpine:3
also known as:
3.20
3.20.3
latest
alpine:3.20
also known as:
3
3.20.3
latest
- vulnerabilities critical: 0 high: 0 medium: 1 low: 0 critical: 0 high: 0 medium: 1 low: 0
Packages and Vulnerabilities (13 package changes and 0 vulnerability changes)
  • ➖ 10 packages removed
  • ♾️ 3 packages changed
  • 125 packages unchanged
Changes for packages of type apk (3 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:f6aba9f
Version
gatewaydio/gatewayd:latest
ca-certificates 20240705-r0
openssl 3.3.2-r0
pax-utils 1.3.7-r2
Changes for packages of type golang (10 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:f6aba9f
Version
gatewaydio/gatewayd:latest
github.com/armon/go-metrics 0.4.1
github.com/boltdb/bolt 1.3.1
♾️ github.com/gatewayd-io/gatewayd (devel) 0.0.0-20241109120212-7f47dca74c26
github.com/hashicorp/go-immutable-radix 1.0.0
github.com/hashicorp/go-msgpack/v2 2.1.2
github.com/hashicorp/golang-lru 0.5.1
github.com/hashicorp/raft 1.7.1
github.com/hashicorp/raft-boltdb 0.0.0-20231211162105-6c830fa4535e
♾️ google.golang.org/protobuf 1.35.2 1.35.1
♾️ stdlib go1.23.4 1.23.3

- Replace loadEnvVars with loadEnvVarsWithTransform to handle complex env values
- Add special handling for raft.peers to parse JSON array into RaftPeer structs
- Update GlobalKoanf and PluginKoanf to use new transformer function

This change allows proper parsing of list-type environment variables,
specifically for raft peer configurations.
Add gRPC support to the Raft implementation to enable proper request forwarding between nodes. Changes include:

- Add protobuf definitions for Raft service with ForwardApply RPC
- Add gRPC server and client implementations for Raft nodes
- Update Raft configuration to include gRPC addresses
- Implement request forwarding logic for non-leader nodes
- Update node configuration to handle gRPC connections
- Add proper cleanup of gRPC resources during shutdown

The changes enable proper forwarding of apply requests from follower nodes to the leader, improving the distributed consensus mechanism.
Add docker-compose-raft.yaml that configures a 3-node GatewayD cluster using Raft consensus protocol. The setup includes:
- 3 GatewayD nodes with Raft configuration
- Separate read/write PostgreSQL instances
- Redis for caching
- Observability stack (Prometheus, Tempo, Grafana)
- Plugin installation service

This configuration enables high availability and leader election through Raft consensus.
- Improve variable naming in loadEnvVarsWithTransform for better readability
- Clean up error handling in forwardToLeader and ForwardApply
- Add proper error propagation in RPC responses
- Fix string type conversions for peer IDs and addresses
- Organize imports and add missing error package
- Remove unused convertPeers function
- Add clarifying comments for Apply methods

This commit focuses on code quality improvements and better error handling
in the Raft implementation without changing core functionality.
- Implement `TestRPCServer_ForwardApply` to test the `ForwardApply` method of the RPC server, ensuring correct handling of apply requests with various configurations.
- Implement `TestRPCClient` to verify the creation and management of RPC clients, including client retrieval and connection closure.
- Utilize `setupGRPCServer` to create a gRPC server for testing purposes.
- Ensure proper setup and teardown of test nodes and gRPC connections to maintain test isolation and reliability.
- Change `nodeId` and `leaderId` from `node2` to `node1`.
- Add `grpcAddress` with value `127.0.0.1:50051`.
- Update `peers` to an empty list instead of an empty dictionary.

These changes adjust the Raft configuration to reflect the new node setup and include a gRPC address for communication.
The function `v1.NewStruct(args)` only accepts `NewValue`, which requires converting certain types to strings. This change adds support for converting a slice of `config.RaftPeer` to a comma-separated string format. Each peer is formatted as "ID:Address:GRPCAddress". This conversion is necessary to overwrite the peers as an environment variable.
- Updated the checksum value for the plugin configuration to ensure integrity and consistency with the latest changes.
- Replaced `LeaderID` with `IsBootstrap` in Raft configuration across multiple files.
- Updated YAML configuration files (`gatewayd.yaml`, `docker-compose-raft.yaml`) to reflect the new `IsBootstrap` flag.
- Modified Go source files (`config.go`, `constants.go`, `types.go`, `raft.go`) to use `IsBootstrap` instead of `LeaderID`.
- Adjusted test cases in `raft_test.go`, `rpc_test.go`, and `raft_helpers.go` to accommodate the new `IsBootstrap` flag.
- Ensured that the `IsBootstrap` flag is correctly set for nodes intended to bootstrap the Raft cluster.
- Added `t.Helper()` to `setupGRPCServer` and `setupNodes` functions to improve test helper identification.
- Corrected variable naming in `TestRPCServer_ForwardApply` for clarity and consistency.
- Ensured comments end with a period for consistency.
- Updated assertions to use `GetSuccess()` method for better readability.
- Updated Docker image references in `docker-compose-raft.yaml` to use `gatewaydio/gatewayd:latest` and added `pull_policy: always` for consistent image updates.
- Changed server and API addresses in `gatewayd.yaml` for better port management.
- Enhanced logging in `raft.go` by switching from `Info` to `Debug` for certain messages to reduce verbosity.
- Added detailed comments in `raft.go` and `rpc.go` to explain the purpose and functionality of key methods, improving code readability and maintainability.
- Introduced new helper functions with comments to clarify their roles in the Raft and RPC processes.
- Updated `createTestRedis` in `act_helpers_test.go` to use `wait.ForAll` for better reliability by ensuring both log readiness and port listening.
- Enhanced `Test_Run_Async_Redis` in `registry_test.go` by adding a context with a timeout to the consumer subscription for improved test robustness.
- Simplified the sleep duration in `Test_Run_Async_Redis` to reduce unnecessary wait time.
@sinadarbouy sinadarbouy marked this pull request as ready for review December 9, 2024 15:37
@sinadarbouy sinadarbouy requested a review from mostafa December 9, 2024 15:37
Copy link
Member

@mostafa mostafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! ❤️ 🚀

@@ -62,7 +62,10 @@ func createTestRedis(t *testing.T) string {
req := testcontainers.ContainerRequest{
Image: "redis:6",
ExposedPorts: []string{"6379/tcp"},
WaitingFor: wait.ForLog("Ready to accept connections"),
WaitingFor: wait.ForAll(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

cmd/run.go Show resolved Hide resolved
cmd/testdata/gatewayd.yaml Outdated Show resolved Hide resolved
interval: 5s
timeout: 5s
retries: 5
gatewayd-1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ❤️ this.

gatewayd.yaml Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Show resolved Hide resolved
raft/raft.go Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
- Added error handling to record and log errors when Raft node initialization fails.
- Ensured the application exits with a specific error code if the Raft node cannot be started.
- Updated tests to set environment variables for Raft node configuration.
- Added a new error code for Raft node startup failure in the error definitions.

This change ensures that if the Raft node cannot be configured and started, the application will terminate gracefully, preventing further execution with an invalid state.
- Changed the raft address from 127.0.0.1:2223 to 127.0.0.1:2222.
- Updated the nodeID from node2 to node1.

These updates are made to the test data configuration to align with the current test case requirements.
The comment above the constants was misleading, suggesting they were only command types. Updated the comment to reflect that these constants are related to Raft operations.
- Removed the unnecessary `isLeader` variable in the `monitorLeadership` function.
- Directly checked the node's state against `raft.Leader` in the if condition.
Updated the `Shutdown` method in `raft.go` to gracefully handle the `ErrRaftShutdown` error. This change ensures that if the Raft node is already shut down, the error is ignored, preventing unnecessary error handling.
Copy link
Member

@mostafa mostafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your awesome contribution! 🙏 🙌

LGTM! 🚀 🎉

@mostafa mostafa merged commit c94c475 into main Dec 14, 2024
5 checks passed
@mostafa mostafa deleted the feature/setup-raft branch December 14, 2024 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants